From drug repositioning to target repositioning: prediction of therapeutic targets using genetically perturbed transcriptomic signatures

Abstract Motivation A critical element of drug development is the identification of therapeutic targets for diseases. However, the depletion of therapeutic targets is a serious problem. Results In this study, we propose the novel concept of target repositioning, an extension of the concept of drug repositioning, to predict new therapeutic targets for various diseases. Predictions were performed by a trans-disease analysis which integrated genetically perturbed transcriptomic signatures (knockdown of 4345 genes and overexpression of 3114 genes) and disease-specific gene transcriptomic signatures of 79 diseases. The trans-disease method, which takes into account similarities among diseases, enabled us to distinguish the inhibitory from activatory targets and to predict the therapeutic targetability of not only proteins with known target–disease associations but also orphan proteins without known associations. Our proposed method is expected to be useful for understanding the commonality of mechanisms among diseases and for therapeutic target identification in drug discovery. Availability and implementation Supplemental information and software are available at the following website [http://labo.bio.kyutech.ac.jp/~yamani/target_repositioning/]. Supplementary information Supplementary data are available at Bioinformatics online.


Alternative SNP profiling methods
We compared five different approaches using SNP information to select a baseline method for comparison with the proposed methods. We referred to these approaches as "Mean p-value method," "Gene based p-value method," "Mean NES method," "Min-max NES method," and "CADD method," respectively. The "Mean p-value method" used only SNPs within coding regions, while the other methods used SNPs within intergenic regions as well as coding regions. We obtained SNP data for the various diseases from the NHGRI-EBI Genome-wide Association Studies Catalog database (Buniello et al., 2019), and constructed disease-specific SNP profiles. Using the SNP profiles, we predicted therapeutic targets.
In the "Mean p-value method," we predicted therapeutic targets focusing on the p-values of SNPs within coding regions to predict therapeutic targets that can be directly controlled by drugs. When a gene had multiple SNPs or was reported by multiple GWASs, we averaged the p-values for the gene.
We constructed SNP profiles corresponding to the gold standard data by assigning the value 0 to genes with no SNP data. We used −log (') values as predictive scores. Genes that had SNPs with significantly strong associations with disease were considered to be candidate therapeutic targets.
In the "Gene based p-value method," we predicted therapeutic targets using weighted Fisher's statistic (Hou, 2005) as a meta-analysis approach. MAGMA (de Leeuw et al., 2015) was used as the GWAS meta-analysis tool. As input data, we used GWAS summary statistics for each disease obtained from the GWAS Catalog; MAGMA converted SNP-based p-values to gene-based p-values.
For the gene-based p-value, we calculated -log(p-value) and used it as a prediction score. We predicted genes with high scores as candidates for therapeutic targets of each disease.
In the "Mean NES method" and "Min-max NES method," we predicted therapeutic targets using effect size to express strength of biological effect. We obtained normalized effect size (NES) for each SNP from "GTEx_Analysis_v8_eQTL" data obtained from Genotype-Tissue Expression (GTEx; Ardlie et al., 2015). NESs of tissues associated with each disease were extracted. We examined two possible strategies for the NES-based prediction score: i) taking the average and ii) taking the minmax. In the former strategy, we averaged the NES values when multiple SNPs were present for each gene, and referred to it as "Mean NES method." In the latter strategy, we used the maximum and minimum values of NESs for inhibitory target prediction and activatory target prediction, respectively, and referred to it as "Min-max NES method." Genes with positive NES values should be inhibited for inhibitory target prediction, so we predicted genes with high NES values as candidate inhibitory targets. Genes with negative NES values were predicted to be activatory target candidates because genes with negative NES values should be activated.
In the "CADD method," we predicted therapeutic targets based on functional annotation. We used FUMA (Watanabe et al., 2017) to perform functional annotation on SNPs. GWAS summary statistics were used as input data for FUMA. SNPs were mapped onto genes based on their genomic location by positional mapping, which is the default function of FUMA. Since the therapeutic targets must be controlled by drugs, the mapped genes were protein-coding genes, which is the default setting of Rentzsch et al., 2019), which indicates the effect of the mutation on functions, was used as a prediction score. We predicted genes with high CADD scores as candidate therapeutic targets of each disease.

A formulation of trans-disease method
We address the problem of therapeutic indications prediction by focusing on therapeutic targets. Note that there are a number of candidates for diseases, and different diseases may have common characteristics in terms of molecular mechanisms. The same therapeutic targets are sometimes used for multiple diseases. Thus, we propose formulating the problem in the framework of supervised multiple label prediction.
Suppose that there are ) diseases and we are given * targets. We consider predicting which diseases would be treated by a target, that is, the +-th target. Each target is represented by a ,dimensional feature vector as -! in this study, where -! was obtained by averaging the multiple signatures from different cell lines.
We constructed a learning set of target-disease pairs that are pairs given in target-disease associations (see the Methods section for more details). There are ) candidates for diseases, and each target in the learning set is assigned a binary class label representing the . -th disease ( . = 1, 2, ⋯ , )). Let 4 ",! ∈ {0, 1} be the class label for the .-th disease assigned to the +-th target, where 4 ",! = 1 means that +-th target is used for the .-th disease, and 4 ",! = 0 means that the +-th target is not used for the .-th disease.
We construct a predictive model to predict whether the +-th target would be used for the .-th disease (. = 1, 2, ⋯ , )). Linear models are a useful tool to analyze extremely high-dimensional data for both prediction and feature extraction tasks. Thus, we adopt a linear function defined as 9 " = : " $ -, where : " is a ,-dimensional weight vector for the .-th disease. We represent a set of ) model weights by a , × ) matrix defined as < ∶= [: % , : & , ⋯ , : ' ] and estimate the weight matrix < by minimizing an objective function based on the learning set.
To overcome the scarcity of existing knowledge concerning relationships between targets and diseases, we propose learning individual predictive models 9 % , 9 & , ⋯ , 9 ' jointly, sharing information across ) diseases.
We attempt to estimate all of the weight vectors : % , : & , ⋯ , : ' jointly in the models by minimizing the logistic loss as follows: (1) properties. Thus, the optimization problem is written as follows: OPQ / @(<) + R(<). (2) Here we introduce two regularization terms. First, we use a standard ridge regularization term to avoid the over-fitting problem, which is defined as Second, we design another regularization term reflecting the similarities among diseases. In this study we evaluate the similarity among diseases using the cosine similarity and construct an ) × ) similarity matrix W for diseases in which each element X !,1 is a similarity score between the +-th and Y-th diseases. Then, we introduce the following regularization term: 3,- where ‖•‖ is the Euclidean norm, ^ is a diagonal matrix defined as K 4,4 ∶= ∑ X 4," has the effect of bringing the weight vectors : ! and : 1 close to each other if X 4," is high.
A grid search was performed to determine the optimal set of regularization parameters g 7 and g 8 .
We calculated the AUC scores by performing a five-fold cross-validation experiment. We selected parameter values that achieved the highest AUC score. We used the parameters to predict applicable diseases for the therapeutic targets.

Hierarchical clustering of perturbed genes
To examine the features of genetically perturbed gene expression signatures, we performed hierarchical clustering of both the genetically perturbed genes and the L1000 genes. Hierarchical clustering was performed using the Python library SciPy (v 1.4.1, https://www.scipy.org/) with the following parameters: method = "ward" and metric = "euclidean."

Performance comparison of alternative SNP profiling methods
We compared five different approaches using SNP information to select a baseline method for comparison with the proposed methods; "Mean p-value method," "Gene based p-value method," "Mean NES method," "Min-max NES method," and "CADD method" (see Supplementary Methods section).
Supplementary Fig. S1 compares the predictive performances of these different SNP profiling methods. There was little difference between these methods when predicting inhibitory targets, as shown in Supplementary Fig. S1 (A). However, in the case of activatory target prediction, the prediction accuracy of the Mean p-value method was slightly higher than those of other methods, as shown in Supplementary Fig. S1 (B). The Mean p-value method uses only SNPs in the coding region, while the other methods also use SNPs in the intergenic region, indicating that the focus on SNPs within coding regions is slightly more accurate in terms of therapeutic target prediction. Since the mean p-value method uses the averaged p-values of SNPs, while the Gene based p-value method uses gene-based p-values converted by meta-analysis with MAGM, suggesting that the focus on using the averaged p-values is slightly more accurate in terms of therapeutic target prediction. The mean pvalue method uses the significance of association, the Mean NES method and Min-max NES method use the NES score, which represents the strength of the biological effect, and the CADD method, which reflect the effect of the mutation on functions, suggesting that the using the significance of association is slightly more accurate in terms of therapeutic target prediction. These results suggest that the using the significance of association (averaged p-values) is slightly more accurate in terms of therapeutic target prediction.
Based on these results, we decided to use the slightly more accurate Mean p-value method for comparison with the proposed methods. In the main text, this method is referred to as the SNP profiling method.

Performance evaluation of therapeutic target predictions on a cell-by-cell basis
We compared predictive performance when using completed data vs. missing data. As shown in Fig.   4, the completion of genetically perturbed gene expression signature data improved the predictive performance of the methods, particularly when using the trans-disease method (P = 5.01 × 10 5: for inhibitory target-disease pairs; P = 1.09 × 10 5& for activatory target-disease pairs; Wilcoxon signed-rank tests). In all cases, the trans-disease method with completed data outperformed the inverse signature method with missing data (P = 1.47 × 10 5: for inhibitory target-disease pairs; P = 2.53 × 10 5; for activatory target-disease pairs). These results indicate that the trans-disease method with completed data is the most effective method for predicting therapeutic targets for diseases.
We evaluated the ability of the inverse signature method to predict therapeutic targets on a cellby-cell basis. Supplementary Fig. S4 shows a comparison of the AUC scores for each cell line and the associated missing rate in the gene expression signatures for each cell line. Supplementary Fig.   S4 (A) and (B) show evaluations of the inhibitory target prediction and activatory target prediction performances, respectively. For nearly all cell lines, the AUC scores for both the inhibitory and activatory target predictions were about 0.5, which shows that the predictions obtained using the inverse signature method were random inferences. This issue may have arisen because the inverse correlation method involves unsupervised learning. Because the number of known therapeutic targetdisease associations is very limited, the accuracy of this method may be underestimated.
When using the trans-disease method, we evaluated the performance of therapeutic target predictions by performing five-fold cross-validation experiments. We calculated the AUC scores using the method described int the Results section. Supplementary Fig. S5 shows a comparison of the AUC scores for each cell line and the associated missing rate in the gene expression signatures for the same cell lines. Supplementary Fig. S5 (A) shows a performance evaluation for the inhibitory target prediction. For nearly all cell lines, inhibitory target predictions by the trans-disease method were superior to the predictions produced using the inverse signature method. In cell lines with high missing rates, the performance of predictions was poor. Supplementary Fig. S5 (B) shows a performance evaluation for the activatory target prediction. Performances differed depending on the cell line, which suggests that activatory target prediction by the trans-disease method performed well in certain cell lines. Overall, these results indicate that the trans-disease method outperformed the inverse signature method when predicting therapeutic targets. However, in cell lines with high rates of missing data, the trans-disease method was not effective for inhibitory target predictions in particular.
Finally, we compared the predictive performances of the methods for each cell line.

Performance evaluation of data completion method
Supplementary

Supplementary Discussion
Toward the appropriate selection of disease-related cell lines We evaluated the performance of our proposed methods on a cell-by-cell basis (Supplementary Results and Figs. S4-S6). It would be ideal to use only cellular data associated with disease organs; unfortunately, such data associated disease organs is not always available. In fact, the cellular data in this study was limited to cell lines because the cellular data for all diseases studied here was not available. However, we predicted new therapeutic target-disease associations using the proposed method based on the averaged cell line data. Note that our proposed method is applicable to new predictions on a cell-by-cell basis when genetically perturbed gene expression profiles are available for all cell lines that correspond to diseased organs. Accumulation of more data should solve this problem in the future.

Supplementary References
Acar       were performed based on differentially expressed genes of gene expression signatures following IFNG over-expression. The x axis represents the ratio of associated genes to all genes in the term or pathway. The numbers next to the bars represent the number of genes associated with the term or pathway. The color represents the GO or pathway groups. Asterisks indicate the significance level as follows: * " < 0.05, * * ( < 0.01. The functionally grouped network of these GO terms or pathways are shown in Fig. 7 (B).
Supplementary Figure S9. (A) GO, (B) pathway, and (C) PPI network results for TAF1B as an inhibitory target. These analyses were performed using the genes differentially expressed following TAF1B knockdown. (A) The analysis of GO biological process terms and (B) KEGG pathway analysis, with circles representing GO terms and KEGG pathways, respectively. The edges denote term-term interactions and functional groups (GO groups) based on the genes shared between the terms. The node color represents the GO group. The node size represents the term significance; the biggest term from a group is the most significant and is highlighted. The GO terms and pathways in these networks are shown in Supplementary Figs. S10 and S11. (C) PPI network showing the protein (gene)-protein (gene) interaction network altered in response to TAF1B knockdown, as well as the results of network topology analysis. The circles in the network represent proteins. The edges denote protein-protein interactions. The node size represents degree. The edge size represents interaction confidence defined according to STRING.  Figure S10. Distribution of enriched GO terms in GO analysis of TAF1B predicted as the inhibitory target. The bar chart shows the result of GO analysis of the biological process. GO analysis was performed based on differentially expressed genes of gene expression signatures following TAF1B knock-down. The x axis represents the ratio of associated genes to all genes in the term. The color represents the GO groups. Since the number of enriched GO terms was too large, the top 3 GO terms with high significance in each GO group were displayed. The functionally grouped network of these GO terms is shown in Supplementary Fig. 9 (A).

24
Supplementary Figure S11. Distribution of enriched pathways in the pathway analysis of TAF1B predicted as the inhibitory target. The bar chart shows the result of KEGG pathway analysis of the biological process. This analysis was performed based on differentially expressed genes of gene expression signatures following TAF1B knock-down. The x axis represents the ratio of associated genes to all genes in the pathway. The color represents the KEGG pathway groups. The functionally grouped network of these KEGG pathway terms is shown in Supplementary Fig. 9 (B).    , 'S100A1', 'S100P', 'S1PR2',  'TCF3', 'TCFL5',   'TESK1', 'TFCP2', 'TGFB1', 'THAP11', 'TIMELESS', 'TIMP2', 'TLR2', 'TLR9', 'TRIM16', 'TRIM38', 'TRRAP', 'TXN', 'TXNRD1', 'TYRO3', ' Table S4. Leaf nodes belonging to the clusters formed by the hierarchical clustering for overexpressed genes. 'Cluster node id' corresponds to the cluster node ids in the Supplementary Fig. S2 (B). The nodes of the clusters are listed in the 'Overexpressed genes' column. The number of over-expressed genes belonging to the clusters are shown in the 'Count' column represents. Supplementary Table S5. The distribution of inhibitory targets repositioned from the original disease class to other disease classes. Each element represents the number of inhibitory targets repositioned. The rows indicate the original ICD disease chapters, and the number in parentheses represents the number of known inhibitory targets belonging to the disease chapters. The columns indicate the newly predicted ICD disease chapters. The chapters are as follows: Chapter I: certain infectious or parasitic dis-eases; Chapter II: neoplasms; Chapter III: diseases of the blood or blood-forming organs; Chapter IV: diseases of the immune system; Chapter V: endocrine, nutritional, or metabolic diseases; Chapter VI: mental, behavioral, or neurodevelopmental disorders; Chapter VII: sleep-wake disorders; Chapter VIII: diseases of the nervous system; Chapter IX: diseases of the visual system; Chapter X: diseases of the ear or mastoid process; Chapter XI: diseases of the circulatory system; Chapter XII: diseases of the respiratory system; Chapter XIII: diseases of the digestive system; Chapter XIV: diseases of the skin; Chapter XV: diseases of the musculoskeletal system or connective tissue; Chapter XVI: diseases of the genitourinary system; Chapter XVII: conditions related to sexual health; Chapter XVIII: pregnancy, childbirth, or the puerperium; Chapter XIX: certain conditions originating in the perinatal period; Chapter XX: developmental anomalies; Chapter XXI: symptoms, signs, or clinical findings, not else-where classified; and Chapter XXII: injury, poisoning, or certain other consequences of external causes. A network visualizing this matrix is shown in Fig. 5

(A).
ICD chapter (# of targets)  I  II  III  IV  V  VI  VII  VIII IX  X  XI  XII  XIII XIV XV  XVI XVII XVIII XIX Table S6. The distribution of activatory targets repositioned from the original disease class to other disease classes. Each element represents the number of activatory targets repositioned. The rows indicate the original ICD disease chapters, and the number in parentheses represents the number of known activatory targets belonging to the disease chapters. The columns indicate the newly predicted ICD disease chapters. The chapters are as follows: Chapter I: certain infectious or parasitic dis-eases; Chapter II: neoplasms; Chapter III: diseases of the blood or blood-forming organs; Chapter IV: diseases of the immune system; Chapter V: endocrine, nutritional, or metabolic diseases; Chapter VI: mental, behavioral, or neurodevelopmental disorders; Chapter VII: sleep-wake disorders; Chapter VIII: diseases of the nervous system; Chapter IX: diseases of the visual system; Chapter X: diseases of the ear or mastoid process; Chapter XI: diseases of the circulatory system; Chapter XII: diseases of the respiratory system; Chapter XIII: diseases of the digestive system; Chapter XIV: diseases of the skin; Chapter XV: diseases of the musculoskeletal system or connective tissue; Chapter XVI: diseases of the genitourinary system; Chapter XVII: conditions related to sexual health; Chapter XVIII: pregnancy, childbirth, or the puerperium; Chapter XIX: certain conditions originating in the perinatal period; Chapter XX: developmental anomalies; Chapter XXI: symptoms, signs, or clinical findings, not else-where classified; and Chapter XXII: injury, poisoning, or certain other consequences of external causes. A network visualizing this matrix is shown in Fig. 5 (B).
ICD chapter (# of targets)  I  II  III  IV  V  VI  VII  VIII IX  X  XI  XII  XIII XIV XV  XVI XVII XVIII XIX